Lab
5
Wide
and
Deep
Networks
By:
Lawrence
Lim
and
Matthew
Grover
https://www.kaggle.com/datasets/dipeshkhemani/airbnb-cleaned-europe-dataset
The
provided
dataset
contains
information
on
Airbnb
listings
in
various
cities
across
Europe,
such
as
Amsterdam,
Barcelona,
Paris,
and
Rome,
among
others.
The
data
includes
various
attributes
of
each
listing,
such
as
the
listing's
name,
its
host's
name
and
ID,
the
neighborhood
or
neighborhood
group
it
is
located
in,
its
geographic
coordinates,
its
room
type,
its
price
per
night,
the
minimum
number
of
nights
for
a
booking,
the
number
of
reviews
left
by
guests,
and
its
availability
within
365
days.
The
purpose
of
this
dataset
is
to
allow
users
to
analyze
and
explore
various
aspects
of
the
European
Airbnb
market,
such
as
pricing
trends,
room
type
popularity,
host
characteristics,
and
more.
This
data
can
be
used
by
researchers,
analysts,
and
other
stakeholders
to
gain
insights
into
the
European
vacation
rental
industry
and
inform
decision-making
in
areas
such
as
tourism
planning,
real
estate
investment,
and
hospitality
management.
The
provided
dataset
can
be
used
to
develop
and
train
Wide
and
Deep
Network
Architectures
for
various
applications,
such
as
price
prediction,
demand
forecasting,
and
recommender
systems.
For
example,
a
WDNA
model
could
be
trained
using
the
attributes
of
each
Airbnb
listing
as
input
features,
such
as
the
listing's
location,
room
type,
and
host
characteristics,
as
well
as
the
number
of
reviews
and
availability
within
365
days.
The
output
of
the
model
could
be
the
predicted
price
per
night
or
the
predicted
number
of
bookings
for
each
listing.
The
model
could
also
be
used
to
make
personalized
recommendations
to
users
based
on
their
preferences,
such
as
recommending
listings
with
similar
room
types
or
neighborhoods
to
those
they
have
previously
booked.
The
use
of
both
wide
and
deep
layers
in
the
network
can
allow
for
capturing
both
high-level
and
low-level
features
of
the
data,
resulting
in
more
accurate
and
effective
predictions
and
recommendations.
# Attribute | Data Type | Description
# ----------------------------|-----------|-----------------------------------------------------------------
# City | Text | The name of the European city where the Airbnb accommodation is
# Price | Numeric | The nightly price of the Airbnb accommodation in Euros.
# Day | Date | The date when the Airbnb accommodation was booked in YYYY-MM-DD
# Room_Type | Text | The type of Airbnb accommodation (e.g., entire apartment, privat
# Shared_Room | Binary | A binary indicator (0/1) that indicates whether the Airbnb accom
# Private_Room | Binary | A binary indicator (0/1) that indicates whether the Airbnb accom
# Person_Capacity | Numeric | The maximum number of people who can stay in the Airbnb accommod
# Superhost | Binary | A binary indicator (0/1) that indicates whether the Airbnb host
# Multiple_Rooms_Business | Binary | A binary indicator (0/1) that indicates whether the Airbnb accom
# Cleanliness_Rating | Numeric | A numerical rating (0-10) indicating the cleanliness of the Airb
# Guest_Satisfaction | Numeric | A numerical rating (0-10) indicating the satisfaction level of p
# Bedrooms | Numeric | The number of bedrooms in the Airbnb accommodation.
# City_Center (km) | Numeric | The distance in kilometers from the Airbnb accommodation to the
# Metro_Distance_(km) | Numeric | The distance in kilometers from the Airbnb accommodation to the
# Attraction_Index | Numeric | A numerical rating (0-100) indicating the proximity of the Airbn
# Normalized_Attraction_Index | Numeric | The Attraction_Index score normalized between 0 and 100.
# Restaurant_Index | Numeric | A numerical rating (0-100) indicating the proximity of the Airbn
# Normalized_Restaurant_Index | Numeric | The Restaurant_Index score normalized between 0 and 100.
import
pandas
as
pd
df = pd.read_csv("Aemf1.csv")
df
#looking at the number of options for non-numeric attribute
city_list = df['City'].unique()
print
(city_list)
day_list = df['Day'].unique()
print
(day_list)
room_list = df['Room Type'].unique()
print
(room_list)
shared_list = df['Shared Room'].unique()
print
(shared_list)
private_list = df['Private Room'].unique()
print
(private_list)
superhost_list=df['Superhost'].unique()
print
(superhost_list)
Encoding
Features:
The
first
step
is
to
properly
encode
the
categorical
features
as
subsequent
steps,
like
LASSO
regression,
requires
only
numeric-based
features.
Starting
with
the
city
attribute,
we
see
above
that
there
are
a
total
of
9
cities
included
in
the
analysis.
Though
One-hot
encoding
in
this
situation
might
not
be
advised
as
it
will
greatly
increase
the
20
Amsterdam
243.2451433
Weekday
Private
room
False
21
Amsterdam
933.8457573
Weekday
Entire
home/apt
False
22
Amsterdam
251.9157313
Weekday
Private
room
False
23
Amsterdam
377.2877464
Weekday
Private
room
False
24
Amsterdam
245.5885455
Weekday
Private
room
False
25
Amsterdam
217.7020599
Weekday
Private
room
False
26
Amsterdam
295.0343308
Weekday
Private
room
False
27
Amsterdam
295.0343308
Weekday
Private
room
False
28
Amsterdam
1032.971668
Weekday
Entire
home/apt
False
29
Amsterdam
270.4286083
Weekday
Private
room
False
City
object
Rome
21.6
%
Paris
16
%
7
others 62.3
%
Price
float64
34.77933919
185
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
['Amsterdam' 'Athens' 'Barcelona' 'Berlin' 'Budapest' 'Lisbon' 'Paris'
'Rome' 'Vienna']
['Weekday' 'Weekend']
['Private room' 'Entire home/apt' 'Shared room']
[False True]
[ True False]
[False True]
dimensionality
of
our
dataset
as
we
will
end
up
with
9
new
binary
variables,
integer
encoding
is
not
a
good
option
considering
that
the
cities
do
not
have
any
order
to
them.
Integer
encoding
may
introduce
an
arbitrary
ordinal
relationship
between
the
categories
that
do
not
reflect
their
true
relationship.
Through
dimensionality
reduction
in
further
steps,
we
can
reduce
the
dimensionality
to
offset
this
gain.
# One-hot encode the 'City' feature using pandas' get_dummies() function
city_dummies = pd.get_dummies(df['City'])
# Concatenate the one-hot encoded features to the original DataFrame
df = pd.concat([df, city_dummies], axis=1)
# Print the resulting DataFrame
df.head()
Regarding
the
Day,
Shared
Room,
Private
Room,
and
Super
host
features,
we
see
that
each
feature
only
has
two
selection
options.
This
means
we
can
use
binary
numbers
to
represent
the
attributes
through
either
binary
encoding
or
one-hot
encoding.
Binary
encoding
would
simply
keep
the
single
attribute
but
convert
true
to
'1'
and
false
to
'0'.
However,
in
the
case
of
the
Day
attribute,
we
see
that
it
may
not
be
best
to
binary
encode
as
it
is
difficult
for
a
reader
to
determine
whether
1
means
weekend
or
weekday
as
neither
is
associated
with
true
or
false.
Furthermore,
it
is
more
complicated
to
understand
how
a
particular
result
in
the
Day
variable
relates
to
the
target
variable
Price)
without
splitting
the
two
potential
scenarios
into
individual
columns.
Therefore,
I
will
one-hot
encode
the
Day
attribute
and
binary
encode
Shared
Room,
Private
Room,
and
Superhost.
# Use one-hot encoding to represent the 'Day' feature
day_dummies = pd.get_dummies(df['Day'], prefix='Day')
# Concatenate the one-hot encoded features to the original DataFrame
df = pd.concat([df, day_dummies], axis=1)
# Print the resulting DataFrame
df.head()
0
Amsterdam
194.0336981
Weekday
Private
room
False
1
Amsterdam
344.245776
Weekday
Private
room
False
2
Amsterdam
264.1014224
Weekday
Private
room
False
3
Amsterdam
433.529398
Weekday
Private
room
False
4
Amsterdam
485.5529257
Weekday
Private
room
False
City
object
Price
float64
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
0
Amsterdam
194.0336981
Weekday
Private
room
False
1
Amsterdam
344.245776
Weekday
Private
room
False
2
Amsterdam
264.1014224
Weekday
Private
room
False
3
Amsterdam
433.529398
Weekday
Private
room
False
City
object
Price
float64
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
# Use binary encoding to represent the 'Shared_Room' feature
df['Shared_Room_binary'] = df['Shared Room'].astype(int)
# Print the resulting DataFrame
df.head()
# Use binary encoding to represent the 'Private_Room' feature
df['Private_Room_binary'] = df['Private Room'].astype(int)
# Print the resulting DataFrame
df.head()
# Use binary encoding to represent the 'Superhost' feature
df['Superhost_binary'] = df['Superhost'].astype(int)
# Print the resulting DataFrame
df.head()
4
Amsterdam
485.5529257
Weekday
Private
room
False
0
Amsterdam
194.0336981
Weekday
Private
room
False
1
Amsterdam
344.245776
Weekday
Private
room
False
2
Amsterdam
264.1014224
Weekday
Private
room
False
3
Amsterdam
433.529398
Weekday
Private
room
False
4
Amsterdam
485.5529257
Weekday
Private
room
False
City
object
Price
float64
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
0
Amsterdam
194.0336981
Weekday
Private
room
False
1
Amsterdam
344.245776
Weekday
Private
room
False
2
Amsterdam
264.1014224
Weekday
Private
room
False
3
Amsterdam
433.529398
Weekday
Private
room
False
4
Amsterdam
485.5529257
Weekday
Private
room
False
City
object
Price
float64
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
0
Amsterdam
194.0336981
Weekday
Private
room
False
1
Amsterdam
344.245776
Weekday
Private
room
False
City
object
Price
float64
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
Finally
regarding
the
Room
Type
attribute,
to
prevent
duplicitous
columns,
it
may
be
best
to
delete
it
and
merely
create
one
additional
column
called
entire_home_appt_rent.
Thus,
in
effect,
we
will
have
one
hot
encoded
Room
Type
attribute.
df['entire_home_appt_rent'] = (df['Shared Room'] == 0) & (df['Private Room'] == 0)
df['entire_home_appt_rent'] = df['entire_home_appt_rent'].astype(int)
df.head()
Now,
I
will
delete
the
old
columns
df_copy = df.copy()
#making a deep copy just in case need to refer back
df= df.drop(['City'], axis=1)
df= df.drop(['Day'], axis=1)
df= df.drop(['Room Type'], axis=1)
df= df.drop(['Shared Room'], axis=1)
df= df.drop(['Private Room'], axis=1)
df= df.drop(['Superhost'], axis=1)
df= df.drop(['Attraction Index'], axis=1)
df= df.drop(['Restraunt Index'], axis=1)
print
(df.columns)
2
Amsterdam
264.1014224
Weekday
Private
room
False
3
Amsterdam
433.529398
Weekday
Private
room
False
4
Amsterdam
485.5529257
Weekday
Private
room
False
0
Amsterdam
194.0336981
Weekday
Private
room
False
1
Amsterdam
344.245776
Weekday
Private
room
False
2
Amsterdam
264.1014224
Weekday
Private
room
False
3
Amsterdam
433.529398
Weekday
Private
room
False
4
Amsterdam
485.5529257
Weekday
Private
room
False
City
object
Price
float64
Day
object
Room
Type
object
Shared
Room
bool
Private
Room
bool
Pe
Index(['Price', 'Person Capacity', 'Multiple Rooms', 'Business',
'Cleanliness Rating', 'Guest Satisfaction', 'Bedrooms',
'City Center (km)', 'Metro Distance (km)',
'Normalised Attraction Index', 'Normalised Restraunt Index',
'Amsterdam', 'Athens', 'Barcelona', 'Berlin', 'Budapest', 'Lisbon',
'Paris', 'Rome', 'Vienna', 'Day_Weekday', 'Day_Weekend',
'Shared_Room_binary', 'Private_Room_binary', 'Superhost_binary',
'entire_home_appt_rent'],
dtype='object')
Removing
Variables
Ridge
Regression
and
LASSO
In
order
to
further
determine
which
attributes
to
keep
and
remove,
we
have
decided
to
utilize
the
ridge
regression
technique
as
a
means
of
quantifying
the
importance
of
each
of
the
variables.
We
can
use
Ridge
regression
for
feature
selection
indirectly,
by
identifying
the
most
important
features
based
on
their
coefficients'
magnitude
after
regularization.
This
magnitude
value
for
each
value
refers
to
the
amount
by
which
the
price
would
change
if
the
particular
attribute
were
to
increase
by
one
unit.
Ridge
Regression:
import
pandas
as
pd
import
statsmodels.api
as
sm
from
sklearn.linear_model
import
Ridge
from
sklearn.model_selection
import
train_test_split, cross_val_score
from
sklearn.preprocessing
import
StandardScaler
# Read in the data
data = df
# Separate the target variable (Price) from the features
y = data['Price']
X = data.drop('Price', axis=1)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Create a Ridge model
ridge = Ridge(alpha=1.0)
# Perform cross-validation on the model
cv_scores = cross_val_score(ridge, X_train, y_train, cv=5)
# Train the model on the full training data
ridge.fit(X_train, y_train)
# Evaluate the model on the test data
score = ridge.score(X_test, y_test)
# Print the coefficients of the model, sorted by magnitude
coef = pd.DataFrame({'Feature': X.columns, 'Coefficient': ridge.coef_})
coef = coef.reindex(coef['Coefficient'].abs().sort_values(ascending=False).index)
print
(coef)
# Print the cross-validation scores and the mean score
print
("Cross-validation scores:", cv_scores)
# Perform hypothesis testing on the coefficients
X_train = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train)
results = model.fit()
print
("")
print
("")
# Print the list of attributes that do not have a statistically significant impact on price
threshold = 0.05
insignificant_attrs = list(X.columns[results.pvalues[1:] > threshold])
if
len(insignificant_attrs) > 0:
print
('The following attributes do not have a statistically significant impact on price:')
print
(insignificant_attrs)
else
:
print
('All attributes have a statistically significant impact on price.')
The
results
above
show
the
coefficient
values
ordered
by
their
magnitude
and
give
us
potentially
valuable
insights
regarding
how
each
attribute
contributes
to
the
overall
price.
Firstly,
we
see
that
the
largest
contributors
to
the
price
of
a
particular
location
include
the
city
it
is
situated
in,
the
#
of
bedrooms,
and
the
relative/quantified
attractiveness
value
of
each
location.
The
data
also
demonstrates
that
consumers
place
little
value
on
whether
a
location
is
near
the
metro
or
whether
or
not
the
location
is
situated
near
the
city
center
or
the
outskirts.
Furthermore,
a
housing
location
near
Amsterdam,
Paris,
or
Barcelona
results
in
increased
prices
while
Lisbon
and
Rome
result
in
lower
prices,
whether
an
Airbnb
is
situated
in
Vienna
or
not
has
virtually
no
significant
impact
on
the
price.
To
determine
which
attributes
to
remove,
we
analyze
the
coefficient
results
by
performing
hypothesis
testing
with
a
95
%
confidence
interval
or
threshold
of
0.05.
Thus,
we
can
determine
which
of
the
lower
magnitude
values
are
not
statistically
significant
and
which
are.
The
results
above
show
us
that
City
Center,
Metro
Distance,
Vienna,
and
Superhost_binary
attributes
are
insignificant,
and
thus,
should
be
removed
from
the
dataset.
Another
method
of
feature
reduction
that
should
be
implemented
is
the
LASSO
method.
While
LASSO
and
Ridge
Regression
are
both
means
of
handling
multicollinearity
in
features,
LASSO
is
able
to
reduce
the
coefficient
values
of
features
all
the
way
to
zero.
Thus,
LASSO
is
able
to
select
for
us
the
features
that
it
considers
to
be
the
most
important
while
outright
rejecting
features
that
do
not
contribute
any
new
insights
to
the
dataset.
import
pandas
as
pd
import
statsmodels.api
as
sm
from
sklearn.linear_model
import
Lasso
from
sklearn.model_selection
import
train_test_split, cross_val_score
from
sklearn.preprocessing
import
StandardScaler
Feature Coefficient
10 Amsterdam 67.906292
5 Bedrooms 45.278787
16 Paris 39.946372
8 Normalised Attraction Index 37.880234
11 Athens -37.693243
14 Budapest -35.397368
0 Person Capacity 28.531626
17 Rome -22.263410
24 entire_home_appt_rent 21.531718
2 Business 20.673031
12 Barcelona 19.289359
22 Private_Room_binary -19.276787
9 Normalised Restraunt Index 13.452512
21 Shared_Room_binary -12.781874
15 Lisbon -8.096214
1 Multiple Rooms 6.127616
3 Cleanliness Rating 5.475479
4 Guest Satisfaction 5.237409
13 Berlin 2.876418
20 Day_Weekend 2.204834
19 Day_Weekday -2.204834
7 Metro Distance (km) 0.575413
23 Superhost_binary 0.056350
6 City Center (km) -0.053507
18 Vienna 0.012645
Cross-validation scores: [0.26728968 0.18219008 0.40825498 0.33301679 0.30197632]
The following attributes do not have a statistically significant impact on price:
# Read in the data
data = df
# Separate the target variable (Price) from the features
y = data['Price']
X = data.drop('Price', axis=1)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Create a Lasso model
lasso = Lasso(alpha=1.0)
# Perform cross-validation on the model
cv_scores = cross_val_score(lasso, X_train, y_train, cv=5)
# Train the model on the full training data
lasso.fit(X_train, y_train)
# Evaluate the model on the test data
score = lasso.score(X_test, y_test)
# Print the coefficients of the model, sorted by magnitude
coef = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso.coef_})
coef = coef.reindex(coef['Coefficient'].abs().sort_values(ascending=False).index)
print
(coef)
# Print the cross-validation scores and the mean score
print
("Cross-validation scores:", cv_scores)
# Perform hypothesis testing on the coefficients
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train)
results = model.fit()
Feature Coefficient
10 Amsterdam 7.042547e+01
16 Paris 4.545510e+01
5 Bedrooms 4.435405e+01
8 Normalised Attraction Index 4.013301e+01
24 entire_home_appt_rent 3.912207e+01
11 Athens -3.036790e+01
14 Budapest -2.840826e+01
0 Person Capacity 2.809245e+01
12 Barcelona 2.189942e+01
2 Business 1.743443e+01
17 Rome -1.351047e+01
9 Normalised Restraunt Index 1.088857e+01
21 Shared_Room_binary -8.184567e+00
13 Berlin 5.480351e+00
3 Cleanliness Rating 4.837223e+00
4 Guest Satisfaction 4.294201e+00
18 Vienna 3.687602e+00
1 Multiple Rooms 3.290093e+00
19 Day_Weekday -3.233115e+00
20 Day_Weekend 4.360647e-16
6 City Center (km) 0.000000e+00
The
results
above
show
that
features
that
have
a
coefficient
near
zero
include
Metro
Distance,
Superhost_binary,
and
City
Center
(just
like
for
ridge
regression).
However,
we
also
see
that
Lisbon
is
the
city
that
LASSO
requests
that
we
remove
rather
than
Vienna
for
Ridge
Regression.
Furthermore,
looking
at
the
cross-validation
scores,
we
see
that
LASSO
performs
similarly
to
Ridge
Regression
across
the
different
attempts
of
different
randomized
splits.
Therefore,
cross-validation
is
not
a
differentiator.
When
looking
at
both
LASSO
and
Ridge
Regression,
we
see
that
Lisbon
performs
poorly
in
LASSO
but
well
in
ridge
regression,
while
Vienna,
even
though
it
does
not
have
a
coefficient
of
zero,
has
a
coefficient
on
the
lower-end
of
the
LASSO
model.
Thus,
Vienna
has
a
relatively
low
influence
on
data
variance
in
both
models.
This
difference
could
be
due
to
the
fact
that
LASSO
utilizes
L1
Regularization
which
forces
the
coefficient
to
zero
while
Ridge
Regression
using
L2
regularization,
which
uses
the
square
of
the
coefficient
(not
forcing
the
value
to
zero).
In
essence,
thi
suggets
that
we
should
eliminate
Vienna
as
both
models
have
it
on
the
lower
end
of
the
coefficient
spectrum,
while
Lisbon,
should
still
be
kept.
df= df.drop(['Metro Distance (km)'], axis=1)
df= df.drop(['City Center (km)'], axis=1)
df= df.drop(['Superhost_binary'], axis=1)
df= df.drop(['Vienna'], axis=1)
print
(df.columns)
Determining/Performing
Cross
Product
of
features
To
determine
the
features
that
should
be
combined
into
a
cross
product,
I
have
perform
correlation
analysis
with
the
correlation
matrix
below.
I
then
simplify
the
matrix
to
only
give
me
the
feature
comparisons
with
high
correlation.
In
statistics,
this
is
defined
as
features
with
a
correlation
of
0.7
or
above.
import
seaborn
as
sns
import
matplotlib.pyplot
as
plt
corr_matrix = df.corr()
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.show()
15 Lisbon -0.000000e+00
22 Private_Room_binary -0.000000e+00
23 Superhost_binary 0.000000e+00
7 Metro Distance (km) 0.000000e+00
Cross-validation scores: [0.26700137 0.1815365 0.40973316 0.33210292 0.302042 ]
Index(['Price', 'Person Capacity', 'Multiple Rooms', 'Business',
'Cleanliness Rating', 'Guest Satisfaction', 'Bedrooms',
'Normalised Attraction Index', 'Normalised Restraunt Index',
'Amsterdam', 'Athens', 'Barcelona', 'Berlin', 'Budapest', 'Lisbon',
'Paris', 'Rome', 'Day_Weekday', 'Day_Weekend', 'Shared_Room_binary',
'Private_Room_binary', 'entire_home_appt_rent'],
dtype='object')
# Find the attribute comparisons with correlations greater than 70%
high_corr_pairs = []
high_corr_values = []
for
i
in
range(len(corr_matrix.columns)):
for
j
in
range(i):
if
abs(corr_matrix.iloc[i, j]) > 0.7
and
i != j:
pair = corr_matrix.columns[i] + ' - ' + corr_matrix.columns[j]
corr_value = corr_matrix.iloc[i, j]
high_corr_pairs.append(pair)
high_corr_values.append(corr_value)
# Print the attribute comparisons with correlations greater than 70% and their correlation values
if
high_corr_pairs:
print
('Attribute comparisons with correlations greater than 70%:')
for
i
in
range(len(high_corr_pairs)):
print
(high_corr_pairs[i] + ': ' + str(high_corr_values[i]))
else
:
print
('No attribute comparisons with correlations greater than 70%.')
The
results
above
show
that
excluding
attribute
comparisons
with
itself,
the
listed
attributes
above
have
a
statistically
high
correlation.
however,
it
is
also
important
to
understand
the
context
of
these
correlations.
for
instance,
we
would
expect
a
perfect
correlation
for
weekends
and
weekdays,
as
they
are
opposite
binary
values
that
is
an
either-or
situation.
Furthermore,
an
Airb&b
that
is
offering
an
entire
home
will
almost
never
be
a
private
Attribute comparisons with correlations greater than 70%:
Normalised Restraunt Index - Normalised Attraction Index: 0.7013877285762937
Day_Weekend - Day_Weekday: -1.0000000000000002
entire_home_appt_rent - Private_Room_binary: -0.9827047693020984
room
unless
it
is
a
1
room
home.
Therefore,
the
only
comparison
to
take
seriously
is
the
Restaurant
and
attraction
index.
Combining
the
Restaurant
index
and
attraction
index
can
provide
better
insights
as
it
is
possible
that
both
features
together
can
better
explain
the
target
variable
(price)
than
if
the
two
variables
remain
separated.
This
makes
sense
as
people
may
often
choose
a
home,
apartment,
or
hotel
based
on
its
relative
location
to
major
tourist
attractions
nearby
as
well
as
good
places
to
eat.
# Cross-product combine Normalised Restraunt Index and Normalised Attraction Index
df['Restraunt_Attraction'] = df['Normalised Restraunt Index'] * df['Normalised Attraction Index']
# Creating other cross columns with other highly correlated features
df['Clean_Satisfaction'] = df['Cleanliness Rating'] * df['Guest Satisfaction']
df['Bedrooms_Capacity'] = df['Bedrooms'] * df['Person Capacity']
# Print the updated data with the new feature
df.head()
df= df.drop(['Normalised Attraction Index'], axis=1)
df= df.drop(['Normalised Restraunt Index'], axis=1)
num_rows = len(df)
num_rows
print
(df.columns)
The
Final
List
of
Attributes:
from
PIL
import
Image
from
IPython.display
import
display
0
194.0336981
2
1
0
10
1
344.245776
4
0
0
8
2
264.1014224
2
0
1
9
3
433.529398
4
0
1
9
4
485.5529257
2
0
0
10
Price
float64
Person
Capacity
i
Multiple
Rooms
i
Business
int64
Cleanliness
Rating
Guest
Satisfaction
Be
41714
Index(['Price', 'Person Capacity', 'Multiple Rooms', 'Business',
'Cleanliness Rating', 'Guest Satisfaction', 'Bedrooms', 'Amsterdam',
'Athens', 'Barcelona', 'Berlin', 'Budapest', 'Lisbon', 'Paris', 'Rome',
'Day_Weekday', 'Day_Weekend', 'Shared_Room_binary',
'Private_Room_binary', 'entire_home_appt_rent', 'Restraunt_Attraction',
'Clean_Satisfaction', 'Bedrooms_Capacity'],
dtype='object')
# Open the image
image = Image.open("final_attributes.png")
# Display the image
display(image)
Metrics
to
Evaluate
Algorithm
Performance:
The
metrics
that
I
will
choose
to
evaluate
depending
on
the
context
of
the
problem.
In
this
instance,
our
goal
s
not
to
classify
a
type
of
Airb&b
using
a
binary
of
multiple
classes
but
rather
to
predict
the
price.
Thus,
this
problem
would
employ
a
regression-based
algorithm
that
can
be
measured
using
metrics
like
mean
absolute
error,
mean
average
percentage
error,
and
root
mean
square
error.
Considering
MAPE,
it
is
important
to
note
that
it
certainly
is
beneficial
from
an
analysis
perspective
as
it
is
easier
to
quantify
relative
performance
and
understand
the
context
of
the
differences
in
price
from
estimate
and
actual.
For
instance,
if
the
price
of
one
AirB&B
is
$100
and
another
is
$200,
the
percentage
error
is
quantified
as
100
%
However,
if
one
Airb&b
is
$2200
and
another
is
$2300,
the
difference
in
price
is
still
$100,
but
the
MAPE
properly
quantifies
the
relative
difference
as
just
4.34
%
,
which
is
far
better
performance.
However,
MAPE
has
the
crippling
downside
of
being
sensitive
to
outliers.
So
if
one
or
a
few
homes
in
a
particular
city
happen
to
be
more
super
expensive
compared
to
the
norm,
it
will
skew
the
overall
MAPE
more
so
than
other
metrics.
Comparing
RMSE,
MSE,
and
MAE,
we
see
that
MAE
is
both
an
easier
metric
to
understand
and
a
metric
that
does
not
square
the
errors.
This
makes
MAE
also
even
less
susceptible
to
outliers.
Thus,
we
choose
MAE
as
our
metric.
Method
of
Data
Division
for
Training
and
Testing:
The
choice
of
method
for
dividing
the
data
depends
on
the
specific
problem
being
solved
and
the
nature
of
the
data.
It
is
important
to
choose
a
method
that
takes
into
account
the
relationships
between
the
attributes
and
the
target
variable.
Stratified
or
Group
K-fold
cross-validation
techniques
are
potential
options.
A
shuffle
split,
while
good
for
a
very
large
number
of
rows,
may
not
work
well
for
our
dataset
as
there
are
a
significant
number
of
groupings
that
have
to
be
considered
like
the
city
location,
location
of
an
Airb&b
relative
to
attractions
and
restaurants,
relative
cleanliness
ranking,
etc.
Comparing
a
group
k-fold
and
a
stratified
k-fold,
we
see
that
the
stratified
fold
will
split
to
ensure
that
each
dataset
has
an
equivalent
distribution
of
Airbnb
prices.
on
the
other
hand,
a
group
k-fold
will
split
based
on
a
single
grouping
variable
(like
the
city)
to
ensure
that
cities
are
equivalently
distributed
between
the
train
and
testing
set.
Considering
that
we
are
trying
to
determine
the
price
based
on
multiple
attribute
characteristics,
group
k-folds
would
not
be
ideal
as
it
forces
us
to
pick
which
attribute
to
group
by.
But
furthermore,
we
have
also
found
out
that
stratified
k-fold
cross-validation
can
only
be
performed
on
categorical
variables
(i.e:
we
would
have
to
define
what
price
is
"expensive"
"medium"
or
"low
price).
Since
we
want
to
predict
the
numerical
price
of
each
home
rather
than
a
category,
we
would
want
to
utilize
a
normal
k-fold.
Modeling:
Combined
Wide
and
Deep
Branch
Network:
Below,
we
define
a
regression
model
using
the
Keras
library
and
TensorFlow.
The
model
uses
a
combination
of
wide
and
deep
networks,
where
the
wide
network
has
a
single
layer
with
20
units
and
the
two
branches
of
deep
networks
have
two
and
three
layers
with
10,
5,
15,
and
10,
5
units
respectively.
The
model
is
compiled
using
the
mean
absolute
error
MAE
as
the
loss
function
and
stochastic
gradient
descent
SGD
as
the
optimizer.
The
model
is
trained
using
K-fold
cross-validation
and
evaluated
using
the
MAE
and
mean
absolute
percentage
error
MAPE
metrics.
One
important
thing
that
we
learned
when
testing
our
model
is
that
the
activation
function
for
the
output
layer
'predictions'
must
be
'relu'
rather
than
sgd.
When
trying
sgd
for
our
prediction,
we
found
that
the
model
performed
extremely
poorly
with
a
MAE
above
250
and
a
MAPE
near
100
%
.
This
was
because
sgd
is
a
linear
function
that
is
too
simplistic
to
capture
more
sophisticated
relationships.
# Derived from https://github.com/eclarson/MachineLearningNotebooks/blob/master/10.%20Keras%20Wide%20and%20D
from
sklearn
import
metrics
as
mt
import
tensorflow
as
tf
from
tensorflow
import
keras
import
os
import
numpy
as
np
os.environ['AUTOGRAPH_VERBOSITY'] = '0'
from
tensorflow.keras.layers
import
Dense, Activation, Input, concatenate
from
tensorflow.keras.models
import
Model
from
tensorflow.keras.wrappers.scikit_learn
import
KerasRegressor
from
sklearn.model_selection
import
KFold, cross_val_score
# Define the input size
num_features = X_train.shape[1]
input_tensor = Input(shape=(num_features,))
#Creating the 3 combined wide/deep networks
# Define the wide branch
wide_branch = Dense(units=20, activation='relu')(input_tensor)
# Define the first deep branch
deep_branch_1 = Dense(units=10, activation='relu')(input_tensor)
deep_branch_1 = Dense(units=5, activation='relu')(deep_branch_1)
# Define the second deep branch
deep_branch_2 = Dense(units=15, activation='relu')(input_tensor)
deep_branch_2 = Dense(units=10, activation='relu')(deep_branch_2)
deep_branch_2 = Dense(units=5, activation='relu')(deep_branch_2)
# Merge the branches together
merged = concatenate([wide_branch, deep_branch_1, deep_branch_2])
# Define the output layer
predictions = Dense(1, activation='relu')(merged)
# Define a function that returns the compiled model
def
create_model():
model = Model(inputs=input_tensor, outputs=predictions)
model.compile(optimizer='sgd', loss='mae')
return
model
# Wrap the Keras model in a scikit-learn estimator
estimator = KerasRegressor(build_fn=create_model, epochs=10, batch_size=50, verbose=1)
# Define k-fold cross-validation
kf = KFold(n_splits=3)
# Train and evaluate the model using k-fold cross-validation
mae_scores = -1 * cross_val_score(estimator, X_train, y_train, cv=kf, scoring='neg_mean_absolute_error')
mape_scores = -1 * cross_val_score(estimator, X_train, y_train, cv=kf, scoring='neg_mean_absolute_percentage
# Fit the model on the full training set
estimator.fit(X_train, y_train)
# Evaluate the model on the test set
yhat = estimator.predict(X_test)
print
('Mean Absolute Error:', mt.mean_absolute_error(y_test, yhat))
print
('Mean Absolute Percentage Error:', mt.mean_absolute_percentage_error(y_test, yhat))
print
('Average MAE:', np.mean(mae_scores))
print
('Average MAPE:', np.mean(mape_scores))
Mean
Absolute
Error:
63.719224877474
Mean
Absolute
Percentage
Error:
0.21056294795070363
Average
MAE
64.8300234832439
Average
MAPE
0.2235887023849258
Visualization
of
Network
Performance:
445/445 [==============================] - 1s 2ms/step - loss: 63.3045
Epoch 8/10
445/445 [==============================] - 1s 2ms/step - loss: 63.2145
Epoch 9/10
445/445 [==============================] - 1s 2ms/step - loss: 63.0805
Epoch 10/10
445/445 [==============================] - 1s 2ms/step - loss: 63.1304
223/223 [==============================] - 0s 1ms/step
Epoch 1/10
668/668 [==============================] - 2s 2ms/step - loss: 62.9419
Epoch 2/10
668/668 [==============================] - 1s 2ms/step - loss: 62.8623
Epoch 3/10
668/668 [==============================] - 1s 2ms/step - loss: 62.7756
Epoch 4/10
668/668 [==============================] - 1s 2ms/step - loss: 62.7593
Epoch 5/10
668/668 [==============================] - 1s 2ms/step - loss: 62.7627
Epoch 6/10
668/668 [==============================] - 1s 2ms/step - loss: 62.7029
Epoch 7/10
668/668 [==============================] - 1s 2ms/step - loss: 62.5457
Epoch 8/10
668/668 [==============================] - 1s 2ms/step - loss: 62.5283
Epoch 9/10
668/668 [==============================] - 1s 2ms/step - loss: 62.4394
Epoch 10/10
668/668 [==============================] - 1s 2ms/step - loss: 62.5004
167/167 [==============================] - 0s 1ms/step
Mean Absolute Error: 63.719224877474
The
results
of
the
below
graph
depict
the
training
and
validation
loss
over
the
numerous
epochs.
To
associate
the
graph
with
the
performance
of
the
model,
we
look
to
see
whether
the
training
and
validation
loss
are
decreasing
over
time.
This
would
mean
the
the
error
of
the
model
is
decreasing
over
time
and
that
the
performance
is
improving.
Below,
we
find
that
while
the
training
loss
has
decreased
over
time,
the
validation
loss
has
been
unpredictable
and
the
net
improvement
has
been
near
zero.
This
suggests
that
the
model
is
possibly
overfitted.
One
potential
solution
to
this
issue
is
reducing
the
number
of
deep
layers
to
prevent
the
model
from
being
over
reliant
on
the
patterns
in
the
training.
# Fit the model on the full training set and save the training and validation loss values after each epoch
train_loss = []
val_loss = []
for
epoch
in
range(10):
history = estimator.fit(X_train, y_train, epochs=1, batch_size=50, verbose=0, validation_data=(X_test, y
train_loss.append(history.history['loss'][0])
val_loss.append(history.history['val_loss'][0])
# Visualize the training and validation loss over time
plt.plot(train_loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training and Validation Loss over Time')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
Investigating
Generalization
Performance:
I
modified
the
code
so
that
all
the
layers
from
each
of
the
two
deep
branches
have
50
units.
I
also
removed
all
but
two
deep
branches
and
a
wide
branch.
The
additional
layers
for
the
deep
branches
have
been
removed.
I
also
reduced
the
number
of
neurons
in
each
layer
to
5,
again,
to
try
to
prevent
overfitting.
The
results
of
the
graph
show
that
in
this
situation,
the
training
loss
now
exceeds
the
validation
loss
and
both
the
training
and
validation
loss
are
still
relatively
high
at
the
end.
This
suggests
that
there
could
be
underfitting
now,
as
the
model
is
both
unable
to
learn
patterns
from
the
training
to
apply
to
the
real
set.
The
actual
results
show
that
the
performance
of
the
model
has
slightly
decreased
with
an
MAE
of
an
additional
$2
per
home.
# Derived from https://github.com/eclarson/MachineLearningNotebooks/blob/master/10.%20Keras%20Wide%20and%20D
from
sklearn
import
metrics
as
mt
import
tensorflow
as
tf
from
tensorflow
import
keras
import
os
import
numpy
as
np
os.environ['AUTOGRAPH_VERBOSITY'] = '0'
from
tensorflow.keras.layers
import
Dense, Activation, Input, concatenate
from
tensorflow.keras.models
import
Model
from
tensorflow.keras.wrappers.scikit_learn
import
KerasRegressor
from
sklearn.model_selection
import
KFold, cross_val_score
# Define the input size
num_features = X_train.shape[1]
input_tensor = Input(shape=(num_features,))
#Creating the 3 combined wide/deep networks
# Define the wide branch
wide_branch = Dense(units=5, activation='relu')(input_tensor)
# Define the first deep branch
deep_branch_1 = Dense(units=5, activation='relu')(input_tensor)
# Define the second deep branch
deep_branch_2 = Dense(units=5, activation='relu')(input_tensor)
# Merge the branches together
merged = concatenate([wide_branch, deep_branch_1, deep_branch_2])
# Define the output layer
predictions = Dense(1, activation='relu')(merged)
# Define a function that returns the compiled model
def
create_model():
model = Model(inputs=input_tensor, outputs=predictions)
model.compile(optimizer='sgd', loss='mae')
return
model
# Wrap the Keras model in a scikit-learn estimator
estimator = KerasRegressor(build_fn=create_model, epochs=10, batch_size=50, verbose=1)
# Define k-fold cross-validation
kf = KFold(n_splits=3)
# Train and evaluate the model using k-fold cross-validation
mae_scores = -1 * cross_val_score(estimator, X_train, y_train, cv=kf, scoring='neg_mean_absolute_error')
mape_scores = -1 * cross_val_score(estimator, X_train, y_train, cv=kf, scoring='neg_mean_absolute_percentage
# Fit the model on the full training set
estimator.fit(X_train, y_train)
# Evaluate the model on the test set
yhat = estimator.predict(X_test)
print
('Mean Absolute Error:', mt.mean_absolute_error(y_test, yhat))
print
('Mean Absolute Percentage Error:', mt.mean_absolute_percentage_error(y_test, yhat))
print
('Average MAE:', np.mean(mae_scores))
print
('Average MAPE:', np.mean(mape_scores))
Mean
Absolute
Error:
65.06654545152466
Mean
Absolute
Percentage
Error:
0.21782960410954044
Average
MAE
67.38108518929515
Average
MAPE
0.2211146833578701
# Fit the model on the full training set and save the training and validation loss values after each epoch
train_loss = []
val_loss = []
for
epoch
in
range(10):
history = estimator.fit(X_train, y_train, epochs=1, batch_size=50, verbose=0, validation_data=(X_test, y
train_loss.append(history.history['loss'][0])
val_loss.append(history.history['val_loss'][0])
# Visualize the training and validation loss over time
plt.plot(train_loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training and Validation Loss over Time')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
Epoch 9/10
445/445 [==============================] - 1s 2ms/step - loss: 65.9345
Epoch 10/10
445/445 [==============================] - 1s 2ms/step - loss: 65.8984
223/223 [==============================] - 0s 1ms/step
Epoch 1/10
668/668 [==============================] - 2s 2ms/step - loss: 65.6138
Epoch 2/10
668/668 [==============================] - 1s 2ms/step - loss: 65.6008
Epoch 3/10
668/668 [==============================] - 1s 2ms/step - loss: 65.5734
Epoch 4/10
668/668 [==============================] - 1s 2ms/step - loss: 65.5188
Epoch 5/10
668/668 [==============================] - 1s 2ms/step - loss: 65.5316
Epoch 6/10
668/668 [==============================] - 1s 2ms/step - loss: 65.4957
Epoch 7/10
668/668 [==============================] - 1s 2ms/step - loss: 65.4941
Epoch 8/10
668/668 [==============================] - 1s 2ms/step - loss: 65.4378
Epoch 9/10
668/668 [==============================] - 1s 2ms/step - loss: 65.4598
Epoch 10/10
668/668 [==============================] - 1s 2ms/step - loss: 65.4271
167/167 [==============================] - 0s 1ms/step
In
the
code
below,
we
attempt
to
solve
the
in
the
above
models
through
the
use
of
cross
validation
on
the
entire
model
instead
of
solely
using
it
on
the
validation
set
as
an
attempt
at
reducing
the
potential
for
overfitting.
Furthermore,
we
use
dropout
layers
to
drop
out
20
%
of
the
data
for
each
of
the
iterations
in
an
attempt
to
make
the
model
more
robust,
with
less
unpredictable
swings
in
the
validation
or
training
loss.
The
results
show
that
the
validation
loss
varies
from
$62.5
to
$65
despite
the
overall
consistent
improvement
in
the
training
loss.
This
suggests
that
this
current
model
may
not
be
able
to
be
improved
beyond
this
level
of
error.
To
investigate
further,
I
will
try
to
look
at
the
distribution
of
the
errors
next.
from
sklearn
import
metrics
as
mt
import
tensorflow
as
tf
from
tensorflow
import
keras
import
os
import
numpy
as
np
from
tensorflow.keras.optimizers
import
SGD
os.environ['AUTOGRAPH_VERBOSITY'] = '0'
from
tensorflow.keras.layers
import
Dense, Activation, Input, concatenate, Dropout
from
tensorflow.keras.models
import
Model
from
tensorflow.keras.wrappers.scikit_learn
import
KerasRegressor
from
tensorflow.keras.callbacks
import
EarlyStopping
from
sklearn.model_selection
import
KFold, cross_val_score
# Define the input size
num_features = X_train.shape[1]
input_tensor = Input(shape=(num_features,))
#Creating the 3 combined wide/deep networks
# Define the wide branch
wide_branch = Dense(units=20, activation='relu')(input_tensor)
# Define the first deep branch
deep_branch_1 = Dense(units=25, activation='relu')(input_tensor)
deep_branch_1 = Dropout(0.2)(deep_branch_1)
deep_branch_1 = Dense(units=25, activation='relu')(deep_branch_1)
deep_branch_1 = Dropout(0.2)(deep_branch_1)
deep_branch_1 = Dense(units=25, activation='relu')(deep_branch_1)
deep_branch_1 = Dropout(0.2)(deep_branch_1)
deep_branch_1 = Dense(units=25, activation='relu')(deep_branch_1)
# Define the second deep branch
deep_branch_2 = Dense(units=25, activation='relu')(input_tensor)
deep_branch_2 = Dropout(0.2)(deep_branch_2)
deep_branch_2 = Dense(units=25, activation='relu')(deep_branch_2)
deep_branch_2 = Dropout(0.2)(deep_branch_2)
deep_branch_2 = Dense(units=25, activation='relu')(deep_branch_2)
deep_branch_2 = Dropout(0.2)(deep_branch_2)
deep_branch_2 = Dense(units=25, activation='relu')(deep_branch_2)
# Merge the branches together
merged = concatenate([wide_branch, deep_branch_1, deep_branch_2])
# Define the output layer
predictions = Dense(1, activation='relu')(merged)
# Define a function that returns the compiled model
def
create_model():
model = Model(inputs=input_tensor, outputs=predictions)
optimizer = SGD(learning_rate=0.1)
model.compile(optimizer=optimizer, loss='mae')
return
model
# # Define a custom callback function for early stopping
# class EarlyStopCallback(tf.keras.callbacks.Callback):
# def on_epoch_end(self, epoch, logs={}):
# val_loss = logs.get('val_loss')
# if val_loss is not None and val_loss < 0.1: # Set the threshold for early stopping here
# self.model.stop_training = True
# Wrap the Keras model in a scikit-learn estimator
estimator = KerasRegressor(build_fn=create_model, epochs=100, batch_size=50, verbose=1)
#100 epochs is ok be
estimator.fit(X_train, y_train)
yhat_test = estimator.predict(X_test)
mae_test = mt.mean_absolute_error(y_test, yhat_test)
mape_test = mt.mean_absolute_percentage_error(y_test, yhat_test)
print
('Mean Absolute Error on test set:', mae_test)
print
('Mean Absolute Percentage Error on test set:', mape_test)
Mean
Absolute
Error
on
test
set:
64.0529991426325
Mean
Absolute
Percentage
Error
on
test
set:
0.21629698883546936
# Fit the model on the full training set and save the training and validation loss values after each epoch
train_loss = []
val_loss = []
for
epoch
in
range(100):
history = estimator.fit(X_train, y_train, epochs=1, batch_size=50, verbose=0, validation_data=(X_test, y
train_loss.append(history.history['loss'][0])
val_loss.append(history.history['val_loss'][0])
# Visualize the training and validation loss over time
plt.plot(train_loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training and Validation Loss over Time')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
Epoch 91/100
668/668 [==============================] - 3s 4ms/step - loss: 64.6953
Epoch 92/100
668/668 [==============================] - 3s 5ms/step - loss: 64.6298
Epoch 93/100
668/668 [==============================] - 4s 6ms/step - loss: 64.5800
Epoch 94/100
668/668 [==============================] - 3s 4ms/step - loss: 64.5449
Epoch 95/100
668/668 [==============================] - 3s 4ms/step - loss: 64.6188
Epoch 96/100
668/668 [==============================] - 3s 4ms/step - loss: 64.6241
Epoch 97/100
668/668 [==============================] - 3s 5ms/step - loss: 64.6567
Epoch 98/100
668/668 [==============================] - 3s 5ms/step - loss: 64.6806
Epoch 99/100
668/668 [==============================] - 3s 5ms/step - loss: 64.6091
Below,
we
see
a
histogram
of
the
distribution
of
%
errors
for
the
actual
and
predicted
values
from
the
test.
We
see
that
the
model
is
normally
distributed,
which
suggests
the
model
is
not
making
a
particular
mistake
(consistently
overestimating
or
underestimating
the
prices
of
homes).
This
suggests
that
the
model
is
performing
reasonably
well
in
regard
to
capturing
all
the
types
of
variations
present
in
the
data.
Thus,
there
is
no
bias
or
learning
problem
creating
underfitting
or
overfitting.
However,
it
does
appear
that
there
is
a
tendency
to
overestimate
the
values
of
some
homes
as
shown
by
the
right
tail
in
the
histogram
and
few
higher
residual
dots
in
the
residual
plot-
more
so
than
underestimating.
This
is
something
we
can
explore
to
see
if
it
is
a
problem.
import
matplotlib.pyplot
as
plt
import
numpy
as
np
# Calculate percentage errors
pct_errors = 100 * (yhat_test - y_test) / y_test
# Calculate mean and standard deviation of percentage errors
mean = np.mean(pct_errors)
std = np.std(pct_errors)
# Plot histogram of percentage errors
plt.hist(pct_errors, bins=20)
# Add normal distribution overlay
x = np.linspace(np.min(pct_errors), np.max(pct_errors), 100)
y = 1 / (std * np.sqrt(2 * np.pi)) * np.exp(-0.5 * ((x - mean) / std) ** 2)
plt.plot(x, y * len(pct_errors) * np.diff(plt.hist(pct_errors, bins=20)[1])[0], 'r-', lw=2)
# Add labels and title
plt.xlabel('Percentage Error')
plt.ylabel('Frequency')
plt.title('Histogram of Percentage Errors with Normal Distribution Overlay')
plt.show()
import
matplotlib.pyplot
as
plt
# Predict on test set
y_pred = estimator.predict(X_test)
# Calculate residuals
residuals = y_test - y_pred
# Create residual plot
plt.scatter(y_pred, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
167/167 [==============================] - 0s 2ms/step
ITOM3355
Fall
22
/
Labs_Lawrence_and_Matthew
Published
at
Apr
16,
2023
Private
Below
we
see
that
only
four
homes
have
greater
than
a
100
%
error,
which
is
not
indicative
of
a
widespread
overpredicting
of
certain
types
of
Airbnb
that
would
cause
our
overall
error
to
increase
significantly.
Thus,
the
few
high
outliers
are
not
an
issue.
import
matplotlib.pyplot
as
plt
# Predict on test set
y_pred = estimator.predict(X_test)
# Calculate residuals
residuals = y_test - y_pred
# Calculate percentage error
pct_error = 100 * np.abs(residuals / y_test)
# Set threshold for percentage error
threshold = 200
high_error_indices = np.where((pct_error > threshold) & (y_pred > y_test))[0]
num_high_error_rows = len(high_error_indices)
print
(f"Number of rows with percentage error greater than {threshold-100}% and predicted price > actual pric
#Get indices of rows with high percentage error and predicted price > actual price
high_error_indices = np.where((pct_error > threshold) & (y_pred > y_test))[0]
# Print columns of rows with high percentage error and predicted price > actual price
for
index
in
high_error_indices:
if
index
in
pct_error.index:
row = X_test[index]
actual_price = y_test[index]
predicted_price = y_pred[index]
error = pct_error[index]
print
(f"Index: {index}, Actual Price: {actual_price}, Predicted Price: {predicted_price}")
print
()
167/167 [==============================] - 0s 1ms/step
Number of rows with percentage error greater than 100% and predicted price > actual price: 4
Standard
Multi-Layer
Perceptron:
To
modify
the
code
to
be
a
standard
multi-layer
perceptron,
we
can
remove
the
wide
and
deep
branches
and
merge
layers,
and
instead
use
a
simple
series
of
dense
layers
with
activation
functions.
The
results
below
show
that
the
multi-layer
perceptron
performs
considerably
worse
than
the
wide
and
deep
networks
in
terms
of
the
mean
absolute
error,
which
is
9.38
%
greater
than
our
best
performing
wide
and
deep
network.
This
decrease
in
the
performance
could
be
explained
by
the
fact
that
a
MLP
only
has
a
single
layer
of
neurons
and
thus
cannot
account
as
well
for
both
the
simple
feature
relationships
and
complex
patterns
simultaneously.
from
sklearn.neural_network
import
MLPRegressor
from
sklearn
import
metrics
as
mt
from
sklearn.model_selection
import
KFold, cross_val_score
# Define the input size
num_features = X_train.shape[1]
# Define the MLP model
model = MLPRegressor(hidden_layer_sizes=(5,5), activation='relu', solver='adam', alpha=0.0001, batch_size='a
learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=1000, shuffle
random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs
early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08,
max_fun=15000)
# Define k-fold cross-validation
kf = KFold(n_splits=3)
# Train and evaluate the model using k-fold cross-validation
mae_scores = -1 * cross_val_score(model, X_train, y_train, cv=kf, scoring='neg_mean_absolute_error')
mape_scores = -1 * cross_val_score(model, X_train, y_train, cv=kf, scoring='neg_mean_absolute_percentage_err
# Fit the model on the full training set
model.fit(X_train, y_train)
# Evaluate the model on the test set
yhat = model.predict(X_test)
print
('Mean Absolute Error:', mt.mean_absolute_error(y_test, yhat))
print
('Mean Absolute Percentage Error:', mt.mean_absolute_percentage_error(y_test, yhat))
print
('Average MAE:', np.mean(mae_scores))
print
('Average MAPE:', np.mean(mape_scores))
Results
Summary
Our
first
wide
and
deep
network
has
a
wide
branch,
a
deep
branch
with
two
layers,
and
a
deep
branch
with
four
layers.
After
processing,
it
reached
a
Mean
Absolute
Error
of
64.83
and
a
Mean
Absolute
Percentage
Error
of
21.05
%
Index: 6612, Actual Price: 150.6807583, Predicted Price: 477.8066711425781
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarnin
warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarnin
warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarnin
warnings.warn(
Mean Absolute Error: 71.19445245875352
Mean Absolute Percentage Error: 0.27446299072495883
Average MAE: 72.67603226747723
Average MAPE: 0.28624278116938223
Our
second
wide
and
deep
network
has
a
wide
branch
and
two
deep
branches
with
four
large
layers
to
each
branch.
After
processing,
it
reached
a
Mean
Absolute
Error
of
65.34
and
a
Mean
Absolute
Percentage
Error
of
23.63
%
Our
third
wide
and
deep
network
has
a
wide
branch
and
two
deep
branches
with
one
small
layer
to
each
branch.
After
processing,
it
reached
a
Mean
Absolute
Error
of
64.81
and
a
Mean
Absolute
Percentage
Error
of
21.76
%
The
Multilayer
Perceptron
reached
a
Mean
Absolute
Error
of
71.19
and
a
Mean
Absolute
Percentage
Error
of
27.45
%
In
conclusion,
all
three
wide
and
deep
networks
perform
significantly
better
than
the
multilayer
perceptron,
with
minimal
difference
between
the
three
networks
(the
second
network
performed
slightly
worse
than
the
other
two,
which
was
surprising
since
it
had
the
largest
dense
layers)
Exceptional
Work:
Comparing
the
Multi-Layer
Perceptron
vs.
Deep
and
Wide
using
Bland
Altman
Below,
we
see
a
performance
comparison
between
a
deep
and
wide
neural
network
versus
that
of
a
Multi-Layer
Perceptron.
In
this
instance,
we
are
looking
at
the
performance
of
a
model's
ability
to
predict
the
target
variable
by
looking
at
the
graphical
representation
between
the
residuals
and
the
mean
of
the
predictions
(represented
by
the
horizontal
line).
The
results
below
show
that
the
majority
of
the
residuals
fall
within
a
95
%
confidence
interval,
there
are
residuals
that
fall
outside
the
bounds,
with
those
with
a
higher
being
noticeable,
which
suggests
that
both
models
make
a
greater
number
of
errors
in
overestimating
than
underestimating
the
price.
Furthermore,
we
see
that
the
MLP
also
makes
the
mistake
of
underestimating
the
price
while
the
Deep
and
Wide
Network
does
not
make
this
particular
error,
suggesting
that
the
Deep
and
Wide
Network
is
better
able
understand
the
attributes
that
lead
to
a
home
having
a
higher
price.
from
sklearn.neural_network
import
MLPRegressor
from
sklearn.metrics
import
mean_squared_error
import
matplotlib.pyplot
as
plt
import
numpy
as
np
# Define MLP model
mlp = MLPRegressor(hidden_layer_sizes=(5,5), activation='relu', solver='adam', max_iter=1000, random_state=4
# Train MLP model
mlp.fit(X_train, y_train)
# Evaluate MLP model on test set
y_pred_mlp = mlp.predict(X_test)
mse_mlp = mean_squared_error(y_test, y_pred_mlp)
# Define D&W model
input_tensor = Input(shape=(num_features,))
wide_branch = Dense(units=5, activation='relu')(input_tensor)
deep_branch_1 = Dense(units=5, activation='relu')(input_tensor)
deep_branch_2 = Dense(units=5, activation='relu')(input_tensor)
merged = concatenate([wide_branch, deep_branch_1, deep_branch_2])
predictions = Dense(1, activation='relu')(merged)
model_dw = Model(inputs=input_tensor, outputs=predictions)
model_dw.compile(optimizer='adam', loss='mae')
# Train D&W model
model_dw.fit(X_train, y_train, epochs=10, batch_size=50, verbose=0)
# Evaluate D&W model on test set
y_pred_dw = model_dw.predict(X_test)
mse_dw = mean_squared_error(y_test, y_pred_dw)
# Calculate residuals and mean of predictions for both models
residuals_mlp = y_test.ravel() - y_pred_mlp.ravel()
mean_predictions_mlp = (y_test.ravel() + y_pred_mlp.ravel()) / 2
residuals_dw = y_test.ravel() - y_pred_dw.ravel()
mean_predictions_dw = (y_test.ravel() + y_pred_dw.ravel()) / 2
# Calculate Bland-Altman plot statistics for MLP model
ba_mean_mlp = np.mean(residuals_mlp)
ba_std_mlp = np.std(residuals_mlp)
limits_of_agreement_mlp = [ba_mean_mlp - 1.96 * ba_std_mlp, ba_mean_mlp + 1.96 * ba_std_mlp]
# Calculate Bland-Altman plot statistics for D&W model
ba_mean_dw = np.mean(residuals_dw)
ba_std_dw = np.std(residuals_dw)
limits_of_agreement_dw = [ba_mean_dw - 1.96 * ba_std_dw, ba_mean_dw + 1.96 * ba_std_dw]
# Plot Bland-Altman plots for both models
fig, axs = plt.subplots(ncols=2, figsize=(12,6))
axs[0].scatter(mean_predictions_mlp, residuals_mlp, alpha=0.5)
axs[0].axhline(y=ba_mean_mlp, color='black', linestyle='--')
axs[0].axhline(y=limits_of_agreement_mlp[0], color='red', linestyle='--')
axs[0].axhline(y=limits_of_agreement_mlp[1], color='red', linestyle='--')
axs[0].set_xlabel('Mean of predictions')
axs[0].set_ylabel('Residuals')
axs[0].set_title('Bland-Altman plot for MLP')
axs[1].scatter(mean_predictions_dw, residuals_dw, alpha=0.5)
axs[1].axhline(y=ba_mean_dw, color='black', linestyle='--')
axs[1].axhline(y=limits_of_agreement_dw[0], color='red', linestyle='--')
axs[1].axhline(y=limits_of_agreement_dw[1], color='red', linestyle='--')
Capturing
the
Embedding
Weights
261/261 [==============================] - 0s 1ms/step
<matplotlib.lines.Line2D at 0x7f114fc8c850>
The
purpose
of
plotting
the
embedding
weights
with
respect
to
the
target
variable
is
to
see
how
the
embeddings
are
distributed
across
different
percentile
ranges,
and
whether
there
are
any
discernible
patterns
or
clusters
based
on
these
ranges.
In
the
below
code,
we
apply
two
principal
components
to
the
data
so
that
we
can
visualize
them
in
a
2-dimensional
plot.
The
scatterplot
does
show
that
there
is
some
separation
between
the
Airbnb
homes
above
the
75th
percentile
and
those
in
the
25th
and
even
50th
percentile.
However,
we
do
see
a
lot
of
overlap,
suggesting
that
the
model
is
not
clearly
predicting
the
price
or
that
homes
with
similar
or
the
same
attributes
still,
for
some
reason,
have
varying
prices.
Thus,
perhaps
it
is
possible
that
there
are
additional
attributes
needed
to
explain
the
Airbnb
prices
that
we
do
not
have
in
the
dataset.
For
example,
one
contributor
to
Airbnb
prices
that
contributes
to
the
price
is
the
time
that
the
person
reserves
the
place.
Is
it
just
3
days
in
advance
or
a
few
weeks
prior?
Just
like
airline
or
concert
tickets,
ordering
in
advance
can
cause
the
price
to
be
cheaper.
Below
the
graph,
I
also
printed
the
explained
variance
ratio
for
both
of
the
components.
We
see
that
the
first
component
explains
54
%
of
the
variance
and
the
2nd
principal
component
around
30
%
for
a
total
near
85
%
,
which
is
good.
However,
if
we
wanted
to
further
increase
the
total
explained
variance,
it
would
be
a
good
idea
to
add
an
additional
dimension
and
create
a
3D
plot
from
sklearn.decomposition
import
PCA
# Define the names of the embedding layers
embed_layer_names = ['dense_111', 'dense_112']
# Define a new model that outputs the embeddings of the given layers
embed_model = Model(inputs=model_dw.input, outputs=[model_dw.get_layer(name).output
for
name
in
embed_layer_
# Use the new model to get embeddings for the data
embeddings = embed_model.predict(X_train)
# Use PCA to reduce the dimensionality of the embeddings to 2D
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings[0])
# Find the percentiles of the target variable
percentiles = np.percentile(y_train, [25, 50, 75, 100])
# Create a new feature that indicates which percentile range each data point belongs to
percentile_range = []
y_train = y_train.reset_index(drop=True)
for
i
in
range(len(y_train)):
if
y_train[i] <= percentiles[0]:
percentile_range.append('25th percentile or lower')
elif
y_train[i] > percentiles[0]
and
y_train[i] <= percentiles[1]:
percentile_range.append('Between 25th and 50th percentile')
elif
y_train[i] > percentiles[1]
and
y_train[i] <= percentiles[2]:
percentile_range.append('Between 50th and 75th percentile')
else
:
percentile_range.append('Above 75th percentile')
# Define a dictionary of colors for each category
category_colors = {
'25th percentile or lower': 'red',
'Between 25th and 50th percentile': 'orange',
'Between 50th and 75th percentile': 'green',
'Above 75th percentile': 'blue'
}
# Convert the categories into colors using the dictionary
colors = [category_colors[category]
for
category
in
percentile_range]
# Plot the embeddings in 2D with colors based on categories
plt.scatter(embeddings_2d[:,0], embeddings_2d[:,1], c=colors)
plt.show()
print
(pca.explained_variance_ratio_)
1043/1043 [==============================] - 2s 2ms/step
[0.5487909 0.30481976]